Python Word Sense Disambiguation

(C) 2017-2019 by Damir Cavar

Version: 1.2, November 2019

This is a tutorial related to the discussion of a WordSense disambiguation and various machine learning strategies discussed in the textbook Machine Learning: The Art and Science of Algorithms that Make Sense of Data by Peter Flach.

This tutorial was developed as part of my course material for the courses Machine Learning and Advanced Natural Language Processing in the at Indiana University.

Word Sense Disambiguation

For a simple Bayesian implementation of a Word Sense Disambiguation algorithm we will use the WordNet NLTK module. We import it in the following way:


In [1]:
from nltk.corpus import wordnet

For a word that we want to disambiguate, we need to get all its synsets:


In [2]:
mySynsets = wordnet.synsets('bank')
print(mySynsets)


[Synset('bank.n.01'), Synset('depository_financial_institution.n.01'), Synset('bank.n.03'), Synset('bank.n.04'), Synset('bank.n.05'), Synset('bank.n.06'), Synset('bank.n.07'), Synset('savings_bank.n.02'), Synset('bank.n.09'), Synset('bank.n.10'), Synset('bank.v.01'), Synset('bank.v.02'), Synset('bank.v.03'), Synset('bank.v.04'), Synset('bank.v.05'), Synset('deposit.v.02'), Synset('bank.v.07'), Synset('trust.v.01')]

For each synset we need to get its definition and the examples to use them as bags of words for a comparison:


In [3]:
for s in mySynsets:
    print(s.name())
    text = " ".join( [s.definition()] + s.examples() )
    print(text, "\n", "-" * 20)


bank.n.01
sloping land (especially the slope beside a body of water) they pulled the canoe up on the bank he sat on the bank of the river and watched the currents 
 --------------------
depository_financial_institution.n.01
a financial institution that accepts deposits and channels the money into lending activities he cashed a check at the bank that bank holds the mortgage on my home 
 --------------------
bank.n.03
a long ridge or pile a huge bank of earth 
 --------------------
bank.n.04
an arrangement of similar objects in a row or in tiers he operated a bank of switches 
 --------------------
bank.n.05
a supply or stock held in reserve for future use (especially in emergencies) 
 --------------------
bank.n.06
the funds held by a gambling house or the dealer in some gambling games he tried to break the bank at Monte Carlo 
 --------------------
bank.n.07
a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force 
 --------------------
savings_bank.n.02
a container (usually with a slot in the top) for keeping money at home the coin bank was empty 
 --------------------
bank.n.09
a building in which the business of banking transacted the bank is on the corner of Nassau and Witherspoon 
 --------------------
bank.n.10
a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning) the plane went into a steep bank 
 --------------------
bank.v.01
tip laterally the pilot had to bank the aircraft 
 --------------------
bank.v.02
enclose with a bank bank roads 
 --------------------
bank.v.03
do business with a bank or keep an account at a bank Where do you bank in this town? 
 --------------------
bank.v.04
act as the banker in a game or in gambling 
 --------------------
bank.v.05
be in the banking business 
 --------------------
deposit.v.02
put into a bank account She deposits her paycheck every month 
 --------------------
bank.v.07
cover with ashes so to control the rate of burning bank a fire 
 --------------------
trust.v.01
have confidence or faith in We can trust in God Rely on your friends bank on your good education I swear by my grandmother's recipes 
 --------------------

We will need to join a list of lists into one list, that is, we need to flatten a list of lists. To achive this, we can use the following code:


In [4]:
import itertools
lOfl = [["this"], ["is","a"], ["test"]]
print(list(itertools.chain.from_iterable(lOfl)))


['this', 'is', 'a', 'test']

What we should do is to tokenize and part-of-speech tag the text, that is the descriptions and the examples. We can use NLTK's word_tokenize and pos_tag modules:


In [5]:
from nltk import word_tokenize, pos_tag

Now we can tokenize and PoS-tag the texts:


In [6]:
from nltk.corpus import stopwords
stopw = stopwords.words("english")

for s in mySynsets:
    print(s.name())
    text = pos_tag(word_tokenize(s.definition()))
    text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))
    text2 = [ x for x in text if x[0] not in stopw ]
    print(text2, "\n", "-" * 20)


bank.n.01
[('sloping', 'VBG'), ('land', 'NN'), ('(', '('), ('especially', 'RB'), ('slope', 'NN'), ('beside', 'IN'), ('body', 'NN'), ('water', 'NN'), (')', ')'), ('pulled', 'VBD'), ('canoe', 'NN'), ('bank', 'NN'), ('sat', 'VBD'), ('bank', 'NN'), ('river', 'NN'), ('watched', 'VBD'), ('currents', 'NNS')] 
 --------------------
depository_financial_institution.n.01
[('financial', 'JJ'), ('institution', 'NN'), ('accepts', 'VBZ'), ('deposits', 'NNS'), ('channels', 'NNS'), ('money', 'NN'), ('lending', 'NN'), ('activities', 'NNS'), ('cashed', 'VBD'), ('check', 'NN'), ('bank', 'NN'), ('bank', 'NN'), ('holds', 'VBZ'), ('mortgage', 'NN'), ('home', 'NN')] 
 --------------------
bank.n.03
[('long', 'JJ'), ('ridge', 'NN'), ('pile', 'NN'), ('huge', 'JJ'), ('bank', 'NN'), ('earth', 'NN')] 
 --------------------
bank.n.04
[('arrangement', 'NN'), ('similar', 'JJ'), ('objects', 'NNS'), ('row', 'NN'), ('tiers', 'NNS'), ('operated', 'VBD'), ('bank', 'NN'), ('switches', 'NNS')] 
 --------------------
bank.n.05
[('supply', 'NN'), ('stock', 'NN'), ('held', 'VBN'), ('reserve', 'NN'), ('future', 'JJ'), ('use', 'NN'), ('(', '('), ('especially', 'RB'), ('emergencies', 'NNS'), (')', ')')] 
 --------------------
bank.n.06
[('funds', 'NNS'), ('held', 'VBN'), ('gambling', 'NN'), ('house', 'NN'), ('dealer', 'NN'), ('gambling', 'NN'), ('games', 'NNS'), ('tried', 'VBD'), ('break', 'VB'), ('bank', 'NN'), ('Monte', 'NNP'), ('Carlo', 'NNP')] 
 --------------------
bank.n.07
[('slope', 'NN'), ('turn', 'NN'), ('road', 'NN'), ('track', 'NN'), (';', ':'), ('outside', 'NN'), ('higher', 'JJR'), ('inside', 'NN'), ('order', 'NN'), ('reduce', 'VB'), ('effects', 'NNS'), ('centrifugal', 'JJ'), ('force', 'NN')] 
 --------------------
savings_bank.n.02
[('container', 'NN'), ('(', '('), ('usually', 'RB'), ('slot', 'NN'), ('top', 'NN'), (')', ')'), ('keeping', 'VBG'), ('money', 'NN'), ('home', 'NN'), ('coin', 'NN'), ('bank', 'NN'), ('empty', 'JJ')] 
 --------------------
bank.n.09
[('building', 'NN'), ('business', 'NN'), ('banking', 'NN'), ('transacted', 'VBN'), ('bank', 'NN'), ('corner', 'NN'), ('Nassau', 'NNP'), ('Witherspoon', 'NNP')] 
 --------------------
bank.n.10
[('flight', 'NN'), ('maneuver', 'NN'), (';', ':'), ('aircraft', 'CC'), ('tips', 'NNS'), ('laterally', 'RB'), ('longitudinal', 'JJ'), ('axis', 'NN'), ('(', '('), ('especially', 'RB'), ('turning', 'VBG'), (')', ')'), ('plane', 'NN'), ('went', 'VBD'), ('steep', 'JJ'), ('bank', 'NN')] 
 --------------------
bank.v.01
[('tip', 'NN'), ('laterally', 'RB'), ('pilot', 'NN'), ('bank', 'NN'), ('aircraft', 'NN')] 
 --------------------
bank.v.02
[('enclose', 'RB'), ('bank', 'NN'), ('bank', 'NN'), ('roads', 'NNS')] 
 --------------------
bank.v.03
[('business', 'NN'), ('bank', 'NN'), ('keep', 'VB'), ('account', 'NN'), ('bank', 'NN'), ('Where', 'WRB'), ('bank', 'NN'), ('town', 'NN'), ('?', '.')] 
 --------------------
bank.v.04
[('act', 'NN'), ('banker', 'NN'), ('game', 'NN'), ('gambling', 'VBG')] 
 --------------------
bank.v.05
[('banking', 'NN'), ('business', 'NN')] 
 --------------------
deposit.v.02
[('put', 'VBN'), ('bank', 'NN'), ('account', 'NN'), ('She', 'PRP'), ('deposits', 'VBZ'), ('paycheck', 'NN'), ('every', 'DT'), ('month', 'NN')] 
 --------------------
bank.v.07
[('cover', 'NN'), ('ashes', 'NNS'), ('control', 'VB'), ('rate', 'NN'), ('burning', 'NN'), ('bank', 'NN'), ('fire', 'NN')] 
 --------------------
trust.v.01
[('confidence', 'NN'), ('faith', 'NN'), ('We', 'PRP'), ('trust', 'VB'), ('God', 'NNP'), ('Rely', 'RB'), ('friends', 'NNS'), ('bank', 'NN'), ('good', 'JJ'), ('education', 'NN'), ('I', 'PRP'), ('swear', 'VBP'), ('grandmother', 'NN'), ("'s", 'POS'), ('recipes', 'NNS')] 
 --------------------

In [7]:
from nltk.stem import WordNetLemmatizer

wordnet_lemmatizer = WordNetLemmatizer()

wordnet_lemmatizer.lemmatize('dogs')


Out[7]:
'dog'

The first step that we would take with a text that contains the word that we want to disambiguate is to find its position in the token list.


In [8]:
example = "John saw the dogs barking at the cats."
keyword = "dog"
tokens = word_tokenize(example)
lemmas = [ wordnet_lemmatizer.lemmatize(x) for x in tokens ]
pos = -1

try:
    pos = lemmas.index(keyword)
except ValueError:
    pass

print("Position:", pos)
print(lemmas)


Position: 3
['John', 'saw', 'the', 'dog', 'barking', 'at', 'the', 'cat', '.']

In [9]:
posTokens = pos_tag(tokens)

print("Lemma:", lemmas[pos])
print("  PoS:", posTokens[pos])
print("  Tag:", posTokens[pos][1])
print(" MTag:", posTokens[pos][1][0])


Lemma: dog
  PoS: ('dogs', 'NNS')
  Tag: NNS
 MTag: N

In [10]:
category = posTokens[pos][1][0]

print(category)


N

In [11]:
wType = None
if category == 'N':
    wType = wordnet.NOUN
elif category == 'V':
    wType = wordnet.VERB
elif category == 'J':
    wType = wordnet.ADJ
elif category == 'R':
    wType = wordnet.ADV

print("Type:", wType)


Type: n

In [12]:
wordnet.synsets(keyword, pos=wType)


Out[12]:
[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01')]

In [13]:
for s in wordnet.synsets(keyword, pos=wType):
    print(s.name())
    text = pos_tag(word_tokenize(s.definition()))
    text += list(itertools.chain.from_iterable([ pos_tag(word_tokenize(x)) for x in s.examples() ]))
    print(text, "\n", "-" * 20)


dog.n.01
[('a', 'DT'), ('member', 'NN'), ('of', 'IN'), ('the', 'DT'), ('genus', 'NN'), ('Canis', 'NNP'), ('(', '('), ('probably', 'RB'), ('descended', 'VBN'), ('from', 'IN'), ('the', 'DT'), ('common', 'JJ'), ('wolf', 'NN'), (')', ')'), ('that', 'WDT'), ('has', 'VBZ'), ('been', 'VBN'), ('domesticated', 'VBN'), ('by', 'IN'), ('man', 'NN'), ('since', 'IN'), ('prehistoric', 'JJ'), ('times', 'NNS'), (';', ':'), ('occurs', 'VBZ'), ('in', 'IN'), ('many', 'JJ'), ('breeds', 'NNS'), ('the', 'DT'), ('dog', 'NN'), ('barked', 'VBD'), ('all', 'DT'), ('night', 'NN')] 
 --------------------
frump.n.01
[('a', 'DT'), ('dull', 'JJ'), ('unattractive', 'JJ'), ('unpleasant', 'JJ'), ('girl', 'NN'), ('or', 'CC'), ('woman', 'NN'), ('she', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('reputation', 'NN'), ('as', 'IN'), ('a', 'DT'), ('frump', 'NN'), ('she', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('real', 'JJ'), ('dog', 'NN')] 
 --------------------
dog.n.03
[('informal', 'JJ'), ('term', 'NN'), ('for', 'IN'), ('a', 'DT'), ('man', 'NN'), ('you', 'PRP'), ('lucky', 'VBP'), ('dog', 'VB')] 
 --------------------
cad.n.01
[('someone', 'NN'), ('who', 'WP'), ('is', 'VBZ'), ('morally', 'RB'), ('reprehensible', 'JJ'), ('you', 'PRP'), ('dirty', 'VBP'), ('dog', 'VB')] 
 --------------------
frank.n.02
[('a', 'DT'), ('smooth-textured', 'JJ'), ('sausage', 'NN'), ('of', 'IN'), ('minced', 'JJ'), ('beef', 'NN'), ('or', 'CC'), ('pork', 'NN'), ('usually', 'RB'), ('smoked', 'VBD'), (';', ':'), ('often', 'RB'), ('served', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('bread', 'NN'), ('roll', 'NN')] 
 --------------------
pawl.n.01
[('a', 'DT'), ('hinged', 'JJ'), ('catch', 'NN'), ('that', 'IN'), ('fits', 'VBZ'), ('into', 'IN'), ('a', 'DT'), ('notch', 'NN'), ('of', 'IN'), ('a', 'DT'), ('ratchet', 'NN'), ('to', 'TO'), ('move', 'VB'), ('a', 'DT'), ('wheel', 'NN'), ('forward', 'RB'), ('or', 'CC'), ('prevent', 'VB'), ('it', 'PRP'), ('from', 'IN'), ('moving', 'VBG'), ('backward', 'NN')] 
 --------------------
andiron.n.01
[('metal', 'NN'), ('supports', 'NNS'), ('for', 'IN'), ('logs', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('fireplace', 'NN'), ('the', 'DT'), ('andirons', 'NNS'), ('were', 'VBD'), ('too', 'RB'), ('hot', 'JJ'), ('to', 'TO'), ('touch', 'VB')] 
 --------------------

In [ ]:


In [ ]:


In [ ]: